This is an updated version from the original 2017 publication. Additional trawl metadata, such as the coordinates of each trawl, have been added and incorporated into the relevant data tables.

2 Plastic concentrations

2.1 Importing and Managing Count Data and Trawl Metadata

We currently have data for 38 stations and 108 trawls.

Note that the NK 0009-2 106-1000um fragment concentration is an order of magnitude larger than all others; this is an unreliable count on an anomalously difficult sample, so we remove this concentration from the data set.

2.1.1 Trawl metadata

## [1] 2268   11
## [1] 1540   11

2.1.1.1 Trawl totals across all size fractions

gl.2014.plastics.tot.conc <- 
  as.data.frame(gl.2014.plastics.conc.df[,c('trawl', 
                              'station', 
                              'size_fraction_um', 
                              'total_conc_km2')])

gl.2014.plastics.tot.conc.nona <- 
  na.omit(gl.2014.plastics.tot.conc)

gl.2014.plastics.tot.wide <-
  dcast(data = gl.2014.plastics.tot.conc.nona,
        formula = trawl + station ~ size_fraction_um,
        value.var = 'total_conc_km2',
        fill = NA_real_)

gl.2014.plastics.tot.wide.complete.only <-
  as.data.frame(na.omit(gl.2014.plastics.tot.wide))

gl.2014.plastics.tot.wide.complete.only$total_concentration <- 
  gl.2014.plastics.tot.wide.complete.only$'106-1000' + 
  gl.2014.plastics.tot.wide.complete.only$'1000-4750' +
  gl.2014.plastics.tot.wide.complete.only$'>4750'

gl.2014.plastics.tot.wide.all.data <-
  merge(gl.2014.plastics.tot.wide.complete.only, 
        gl.2014.trawl.data, 
        by = c("trawl","station"), 
        all.x = TRUE, 
        suffixes = c(".x", ".y"))

gl.2014.plastics.tot.wide.all.data.colnames <- c(colnames(gl.2014.plastics.tot.wide.all.data))

gl.2014.plastics.tot.wide.all.data <-
gl.2014.plastics.tot.wide.all.data[ , gl.2014.plastics.tot.wide.all.data.colnames]

# gl.2014.plastics.tot.wide.all.data <-
# gl.2014.plastics.tot.wide.all.data[,c("trawl", 
#   "station",
#   "water_body",
#   "location",
#   "station_class",
#   "106-1000",
#   "1000-4750",
#   ">4750",
#   "total_concentration",
#   "trawl_date",
#   "trawl_start_time",
#   "flowm_dist_m",
#   "flowm_area_km2",
#   "wet_mass",
#   "dry_mass",
#   "air_vel_avg_kt",
#   "cloud_cover_avg",
#   "air_temp_avg_C",
#   "water_temp_avg_C",
#   "E_wtr_vel_surf_avg_kt",
#   "N_wtr_vel_surf_avg_kt",
#   "wv_drctn_avg",
#   "wv_prd_avg",
#   "sig_wv_ht_avg_m")]

datatable(gl.2014.plastics.tot.wide.all.data,
          options = list(pageLength = 25),
          caption = "Total of all fractions for complete trawls (km-2)")

2.1.2 Station metadata

##Exploring Data

2.1.3 Count Data Shape

A simple histogram of all total concentrations by size fraction was created to establish an appreciation for shape of the data.

Repeat the same for the component concentrations.

The 106-1000 µm fraction is pretty evenly distributed (likely because data are more sparse–due to labor intensiveness, not all trawls were counted) and the concentrations for the larger two fractions are heavily postitively skewed (toward the zeroes and smaller numbers). We can also see this in the skewness (very positive) and kurtosis (high positive numbers) for the total concentrations.

2.1.4 Testing Normality

Based on visual inspection, these data do not appear to have a normal distribution. We next test this hypothesis of non-normality with the Shapiro-Wilks test: are the concentrations non-normally distributed? (smaller p-value, higher probability of being non-normal)

##           conc_type concentration
## 1 fragment_conc_km2  1.308586e-28
## 2     line_conc_km2  8.529583e-25
## 3     film_conc_km2  5.179129e-25
## 4     foam_conc_km2  3.103382e-28
## 5   nurdle_conc_km2  1.200617e-30
## 6   sphere_conc_km2  2.406124e-29
## 7    paint_conc_km2  3.243382e-27
## 8    total_conc_km2  4.203199e-28
## 9    fiber_conc_km2  2.584405e-29
##   size_fraction_um concentration
## 1         106-1000  1.068027e-10
## 2        1000-4750  1.020828e-52
## 3            >4750  6.743893e-55
##           conc_type mean_concentration
## 1 fragment_conc_km2       4.048243e-17
## 2     film_conc_km2       1.111000e-13
## 3     line_conc_km2       1.399332e-12
## 4   nurdle_conc_km2       4.725135e-18
## 5   sphere_conc_km2       1.127956e-16
## 6     foam_conc_km2       5.856112e-16
## 7    paint_conc_km2       1.866749e-14
## 8    total_conc_km2       7.522970e-17
## 9    fiber_conc_km2       1.578691e-16
##   size_fraction_um mean_concentration
## 1         106-1000       6.083180e-09
## 2        1000-4750       1.047869e-33
## 3            >4750       2.569836e-36

According to the Shapiro Wilks test on all component trawl concentrations and total concentrations, the distributions vary significantly from normal (p<<<0.5). The same result is observed with station concentrations, although the p-values are higher for station means.

Conclusion: None of these datasets are normally distributed, so parametric tests are not an option without data transformations.

2.1.5 Centrality by Mean or Median?

Based on the data shape, we next ask: what best describes the centrality of these data: mean or median of concentrations?

Mean (red dashed line) and median (blue dotted line) for the total concentration histograms were plotted.

## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 6 rows containing missing values (geom_bar).

Mean appears to represent the data better than the median, because although it doesn’t fall where the majority of the data are, it falls more central to the full range of data.

Representing the spread of the data, however, is going to be more difficult because the data are:
* so postitively skewed that the standard deviation will be very large
* limited by zero on the lower end but not limited on the upper end

A quick summary of the total concentrations also exemplifies this issue, and how it varies across size fractions. The mean concentration for each fraction is suprisingly close to (and for the smallest fraction, greater than) the upper quartile (75th percentile) value. This means ~75% of the concentrations in each size fraction are below the mean value.

The Q-Q plot below shows that the concentrations in the 1000-4750 um size class deviate from linearity and are thus non-normally distributed–with skew particularly evident at higher concentrations.

The above spread can also be represented as a boxplot (red point is mean value). Note that the means are higher than the median (central line of boxplot), and for the mid and large size class are greater than the 75% percentile (top of box).

2.2 Data by trawl (each trawl treated individually)

2.2.1 Summary tables by size fraction

## Using Freq as value column: use value.var to override.
table.gl.2014.conc.by.size.comp.sum <-
  aggregate(data=table.gl.2014.conc.by.size.comp.melt.nona,
            concentration~size_fraction_um+conc_type,
            sum)
 
table.gl.2014.conc.by.size.comp.sum.cast <-
  dcast(data = table.gl.2014.conc.by.size.comp.sum,
        formula = size_fraction_um ~ conc_type,
        fill = NA,
        value.var = "concentration")

table.gl.2014.conc.totals.by.size.comp.prop <-
  cbind(table.gl.2014.conc.by.size.comp.sum.cast$'size_fraction_um', 
        table.gl.2014.conc.by.size.comp.sum.cast[,2:8]/table.gl.2014.conc.by.size.comp.sum.cast$'total_conc_km2')

conc.type.col.names <- c("Size (µm)", 
                         "Fragment", 
                         "Film", 
                         "Line", 
                         "Nurdle", 
                         "Sphere", 
                         "Foam", 
                         "Paint")

colnames(table.gl.2014.conc.totals.by.size.comp.prop) <- conc.type.col.names

table.gl.2014.conc.totals.by.size.comp.prop$n <- 
  trawl.prop.conc.size.count$total_conc_km2

# kable(
#   table.gl.2014.conc.totals.by.size.comp.prop, 
#   digits = 3, 
#   caption = "Proportion of summed individual trawl concentration")

xtable.prop.conc.types <- 
  xtable(table.gl.2014.conc.totals.by.size.comp.prop, 
         digits = 3, 
         caption = "Proportion of individual trawl concentration")

prop.conc.data <- 
  write.csv(
    table.gl.2014.conc.totals.by.size.comp.prop, 
    file = "./outfiles_final/final_proportion_total_conc_by_size_table.csv", 
    na = "NA")

table.gl.2014.conc.totals.by.size.comp.prop.round <- 
  cbind('Size (µm)' = table.gl.2014.conc.totals.by.size.comp.prop[,1], 
        round(table.gl.2014.conc.totals.by.size.comp.prop[,-1],3))

datatable(table.gl.2014.conc.totals.by.size.comp.prop.round,
          options = list(pageLength = 3),
          caption = "Proportion of individual trawl concentration"
          )
## Using Freq as value column: use value.var to override.
## Using Freq as value column: use value.var to override.

2.2.2 Figures per trawl by size

## Warning: Removed 1862 rows containing non-finite values (stat_boxplot).

## Error in percent_format(): could not find function "percent_format"
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1134 rows containing non-finite values (stat_boxplot).

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 150 rows containing non-finite values (stat_boxplot).
## Warning: Removed 80 rows containing missing values (geom_point).

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 150 rows containing non-finite values (stat_boxplot).
## Warning: Removed 80 rows containing missing values (geom_point).

2.2.3 Summary tables by station type

## Using Freq as value column: use value.var to override.
## Using Freq as value column: use value.var to override.
## Using Freq as value column: use value.var to override.

2.3 Data by station (mean of trawls at each station)

2.3.1 Summary tables by size

2.3.1.1 Mean-normalized station range plots

## Warning: Removed 13 rows containing non-finite values (stat_boxplot).

## Warning: Removed 86 rows containing non-finite values (stat_boxplot).
## Warning: Removed 86 rows containing missing values (geom_point).

## Warning: Removed 86 rows containing non-finite values (stat_boxplot).
## Warning: Removed 86 rows containing missing values (geom_point).

## Warning: Ignoring unknown aesthetics: colors
## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 64 rows containing non-finite values (stat_smooth).
## Warning: Removed 30 rows containing missing values (geom_point).

## 
## Call:
## lm(formula = MNR ~ log10(total_conc_km2 + 1), data = gl.2014.trawl.conc.sum.data, 
##     na.action = na.omit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.50917 -0.53223  0.02993  0.34934  1.55651 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                2.73683    0.10775   25.40   <2e-16 ***
## log10(total_conc_km2 + 1) -0.35670    0.02794  -12.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6953 on 206 degrees of freedom
##   (116 observations deleted due to missingness)
## Multiple R-squared:  0.4418, Adjusted R-squared:  0.4391 
## F-statistic:   163 on 1 and 206 DF,  p-value: < 2.2e-16
## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 64 rows containing non-finite values (stat_smooth).
## Warning: Removed 30 rows containing missing values (geom_point).

## Warning: Removed 13 rows containing missing values (geom_smooth).

## Using Freq as value column: use value.var to override.
table.gl.2014.stn.conc.by.size.comp.sum <-
  aggregate(data = table.gl.2014.stn.conc.by.size.comp.long.nona,
            mean_concentration~size_fraction_um+conc_type,
            sum, 
            na.action = na.exclude)
 
table.gl.2014.stn.conc.by.size.comp.sum.cast <-
  dcast(table.gl.2014.stn.conc.by.size.comp.sum,
        size_fraction_um~conc_type,
        value.var = "mean_concentration",
        fill = NA_real_)

table.gl.2014.stn.conc.totals.by.size.comp.prop <-
  cbind(table.gl.2014.stn.conc.by.size.comp.sum.cast$'size_fraction_um', 
        table.gl.2014.stn.conc.by.size.comp.sum.cast[,2:8]/table.gl.2014.stn.conc.by.size.comp.sum.cast$'total_conc_km2')

conc.type.col.names <- c("Size (µm)", 
                         "Fragment", 
                         "Film", 
                         "Line", 
                         "Nurdle", 
                         "Sphere", 
                         "Foam", 
                         "Paint")

colnames(table.gl.2014.stn.conc.totals.by.size.comp.prop) <- 
  conc.type.col.names

table.gl.2014.stn.conc.totals.by.size.comp.prop$n <- 
  stn.prop.conc.size.count$total_conc_km2

xtable.prop.stn.conc.types <- 
  xtable(table.gl.2014.stn.conc.totals.by.size.comp.prop, 
         digits = 3, 
         caption = "Proportion of mean station concentration")

prop.stn.conc.data <- 
  write.csv(
    table.gl.2014.stn.conc.totals.by.size.comp.prop, 
    file = "./outfiles_final/final_proportion_mean_stn_conc_by_size_table.csv", 
    na = "NA")

table.gl.2014.stn.conc.totals.by.size.comp.prop.round <- 
  cbind('Size (µm)' = table.gl.2014.stn.conc.totals.by.size.comp.prop[,1], 
        round(table.gl.2014.stn.conc.totals.by.size.comp.prop[,-1],3))

datatable(table.gl.2014.stn.conc.totals.by.size.comp.prop.round, 
          options = list(pageLength = 3),
          caption = "Proportion of mean station concentration"
          )
## Using Freq as value column: use value.var to override.
## Using Freq as value column: use value.var to override.

2.3.1.2 CoV station range plots

## Warning: Removed 29 rows containing non-finite values (stat_boxplot).

## Warning: Ignoring unknown aesthetics: colors
## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 35 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

## 
## Call:
## lm(formula = CoV ~ log10(total_conc_km2 + 1), data = gl.2014.trawl.conc.sum.data, 
##     na.action = na.omit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.86851 -0.25708 -0.01838  0.22921  0.85492 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.50379    0.06068   24.78   <2e-16 ***
## log10(total_conc_km2 + 1) -0.18428    0.01644  -11.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3838 on 190 degrees of freedom
##   (132 observations deleted due to missingness)
## Multiple R-squared:  0.398,  Adjusted R-squared:  0.3949 
## F-statistic: 125.6 on 1 and 190 DF,  p-value: < 2.2e-16
## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 35 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 11 rows containing missing values (geom_smooth).

2.3.2 Summary tables by station type

The highest single trawl concentration is 1.887951310^{6} km^-2, found in the 106-1000 µm size fraction at the Detroit WWTP of Detroit Rv..

## Using Freq as value column: use value.var to override.
## Using Freq as value column: use value.var to override.
## Using Freq as value column: use value.var to override.

3 Blank counts

Blank Size Fragment Fiber Total 1 1 1000-4750 0 32 32 2 2 1000-4750 0 33 33 3 3 1000-4750 0 8 8 4 1 100-1000 0 7 7 5 2 100-1000 1 19 20 6 3 100-1000 0 26 26 7 1 >4750 0 0 0 8 2 >4750 0 0 0 9 3 >4750 0 0 0

4 EDS data

## Warning in chisq.test(eds.calls.by.suspect): Chi-squared approximation may
## be incorrect
## [1] 2.00255e-23
## NULL
## [1] 1.152347e-27
## [1] "two.sided"

I created a contingency table of a piece’s suspected type vs. type determined from spectra:

##              final_call
## suspected      M M-P NP NP-P  P
##   not plastic 35   1 46   11  7
##   plastic     10   0  2   12 76

So we can see that 76% of all pieces visually identified as plastic were confirmed as plastic, while 12% couldn’t be positively identified as plastic or not plastic by EDS. 81% of pieces were confirmed as not plastic, while only 7% of pieces were incorrectly identified as non-plastic.

To test these percentages for significance, I performed a Chi-square test of independence on the contingency table:

## 
##  Pearson's Chi-squared test
## 
## data:  eds.calls.by.suspect
## X-squared = 112.63, df = 4, p-value < 2.2e-16

It looks as if the p-value is very low, indicating that the data vary significantly from an even distribution, both between call types and between suspected type, but there is a warning that the results may be incorrect, probably because some expected values are very close to zero.

##              final_call
## suspected        M M-P NP NP-P    P
##   not plastic 22.5 0.5 24 11.5 41.5
##   plastic     22.5 0.5 24 11.5 41.5

But when you look at the residuals of the test, it looks as if NP and M calls occur more often in the nonplastic category, and the P and P-NP calls occur more often with the plastic category - yay!

##              final_call
## suspected              M        M-P         NP       NP-P          P
##   not plastic  2.6352314  0.7071068  4.4907312 -0.1474420 -5.3554386
##   plastic     -2.6352314 -0.7071068 -4.4907312  0.1474420  5.3554386

That was kind of fun, so I took a look at what effect the spectrum reader might have:

##                           final_call
## reader                      M M-P NP NP-P  P
##   andrew                    7   0  4    0  9
##   andrew, brendan           7   1  3    1  8
##   brendan                  21   0 33   14 52
##   kaitie                    8   0  2    5  5
##   rachel, brendan, melissa  2   0  6    3  9
## Warning in chisq.test(eds.calls.by.reader): Chi-squared approximation may
## be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  eds.calls.by.reader
## X-squared = 28.535, df = 16, p-value = 0.02727
##                           final_call
## reader                        M M-P   NP NP-P    P
##   andrew                    4.5 0.1  4.8  2.3  8.3
##   andrew, brendan           4.5 0.1  4.8  2.3  8.3
##   brendan                  27.0 0.6 28.8 13.8 49.8
##   kaitie                    4.5 0.1  4.8  2.3  8.3
##   rachel, brendan, melissa  4.5 0.1  4.8  2.3  8.3
##                           final_call
## reader                               M         M-P          NP        NP-P
##   andrew                    1.17851130 -0.31622777 -0.36514837 -1.51657509
##   andrew, brendan           1.17851130  2.84604989 -0.82158384 -0.85719462
##   brendan                  -1.15470054 -0.77459667  0.78262379  0.05383819
##   kaitie                    1.64991582 -0.31622777 -1.27801930  1.78032728
##   rachel, brendan, melissa -1.17851130 -0.31622777  0.54772256  0.46156633
##                           final_call
## reader                               P
##   andrew                    0.24297355
##   andrew, brendan          -0.10413152
##   brendan                   0.31175111
##   kaitie                   -1.14544672
##   rachel, brendan, melissa  0.24297355

It looks as if that pesky M-P call might be screwing with the chi-square’s ability to compare distributions, so it might be worth taking out of the dataset completely (I had an internal battle as to whether it should be left in for future comparisons).

## Warning: Ignoring unknown aesthetics: countour